Usually with these blog posts, I find a problem, I fight with it for a bit, and then I solve the problem. But this post has been sitting half-written in my publishing queue since May (!) this year, and I have entirely failed to solve this issue. So I'm admitting defeat, and publishing this anyway because maybe one of you knows the answer. Or at least it might serve as a warning...
My issue is that I've been working through some really odd and annoying Solr issues which only manifest in Docker on one laptop. I'm really not sure if these are issues that others might see, or if this is a problem that's entirely down to this laptop's setup. But they're definitely a problem...
Edited to add: Since writing this, I have found a reliable work-around, by running ZooKeeper separately inside the Solr container. If you're seeing this issue, you should check that out.
But a week or so ago, the instance I had set up stopped working for me. I could run
up.ps1
but it would respond that it had timed out waiting for the CM role to become available:
I tried a lot of the classic "fix my Sitecore Docker instance" tricks:
docker-compose down
and then up again, rebooting the computer, running
clean.ps1
to get rid of the state, running
up.ps1
again. Running
init.ps1
again. But none of those solved the problem.
Looking at the state of the setup in more detail, the root problem was Solr. It was starting and appeared to be running, but the logs showed two errors. When it first starts, it complains about not being able to see Zookeeper:
2022-04-28 10:39:01.496 INFO (main) [ ] o.e.j.u.log Logging initialized @2211ms to org.eclipse.jetty.util.log.Slf4jLog 2022-04-28 10:39:01.668 INFO (main) [ ] o.e.j.s.Server jetty-9.4.34.v20201102; built: 2020-11-02T14:15:39.302Z; git: e46af88704a893fc12cb0e3bf46e2c7b48a009e7; jvm 1.8.0_252-b09 2022-04-28 10:39:01.699 INFO (main) [ ] o.e.j.d.p.ScanningAppProvider Deployment monitor [file:///C:/solr/server/contexts/] at interval 0 2022-04-28 10:39:02.856 INFO (main) [ ] o.e.j.w.StandardDescriptorProcessor NO JSP Support for /solr, did not find org.apache.jasper.servlet.JspServlet 2022-04-28 10:39:02.871 INFO (main) [ ] o.e.j.s.session DefaultSessionIdManager workerName=node0 2022-04-28 10:39:02.871 INFO (main) [ ] o.e.j.s.session No SessionScavenger set, using defaults 2022-04-28 10:39:02.871 INFO (main) [ ] o.e.j.s.session node0 Scavenging every 600000ms 2022-04-28 10:39:02.965 INFO (main) [ ] o.a.s.s.SolrDispatchFilter Using logger factory org.apache.logging.slf4j.Log4jLoggerFactory 2022-04-28 10:39:02.965 INFO (main) [ ] o.a.s.s.SolrDispatchFilter ___ _ Welcome to Apache Solr™ version 8.8.2 2022-04-28 10:39:02.965 INFO (main) [ ] o.a.s.s.SolrDispatchFilter / __| ___| |_ _ Starting in cloud mode on port 8983 2022-04-28 10:39:02.965 INFO (main) [ ] o.a.s.s.SolrDispatchFilter \__ \/ _ \ | '_| Install dir: C:\solr 2022-04-28 10:39:02.965 INFO (main) [ ] o.a.s.s.SolrDispatchFilter |___/\___/_|_| Start time: 2022-04-28T10:39:02.965Z 2022-04-28 10:39:02.981 INFO (main) [ ] o.a.s.c.SolrPaths Using system property solr.solr.home: c:\\data 2022-04-28 10:39:02.981 INFO (main) [ ] o.a.s.c.SolrXmlConfig Loading container configuration from c:\data\solr.xml 2022-04-28 10:39:03.075 INFO (main) [ ] o.a.s.c.SolrXmlConfig MBean server found: com.sun.jmx.mbeanserver.JmxMBeanServer@47542153, but no JMX reporters were configured - adding default JMX reporter. 2022-04-28 10:39:03.528 INFO (main) [ ] o.a.s.h.c.HttpShardHandlerFactory Host whitelist initialized: WhitelistHostChecker [whitelistHosts=null, whitelistHostCheckingEnabled=true] 2022-04-28 10:39:03.700 WARN (main) [ ] o.e.j.u.s.S.config Trusting all certificates configured for Client@5b7ea70d[provider=null,keyStore=null,trustStore=null] 2022-04-28 10:39:03.700 WARN (main) [ ] o.e.j.u.s.S.config No Client EndPointIdentificationAlgorithm configured for Client@5b7ea70d[provider=null,keyStore=null,trustStore=null] 2022-04-28 10:39:03.856 WARN (main) [ ] o.e.j.u.s.S.config Trusting all certificates configured for Client@10c8f62[provider=null,keyStore=null,trustStore=null] 2022-04-28 10:39:03.856 WARN (main) [ ] o.e.j.u.s.S.config No Client EndPointIdentificationAlgorithm configured for Client@10c8f62[provider=null,keyStore=null,trustStore=null] 2022-04-28 10:39:03.918 INFO (main) [ ] o.a.s.c.SolrZkServerProps Reading configuration from: c:\data\zoo.cfg 2022-04-28 10:39:03.918 INFO (main) [ ] o.a.s.c.SolrZkServer STARTING EMBEDDED STANDALONE ZOOKEEPER SERVER at port 9983 2022-04-28 10:39:03.918 WARN (main) [ ] o.a.s.c.SolrZkServer Embedded Zookeeper is not recommended in production environments. See Reference Guide for details. 2022-04-28 10:39:04.418 INFO (main) [ ] o.a.s.c.ZkContainer Zookeeper client=localhost:9983 2022-04-28 10:39:04.996 WARN (Thread-12) [ ] o.a.z.s.ServerCnxnFactory maxCnxns is not configured, using default value 0. 2022-04-28 10:39:05.012 INFO (main) [ ] o.a.s.c.c.ConnectionManager Waiting for client to connect to ZooKeeper 2022-04-28 10:39:05.028 WARN (NIOWorkerThread-1) [ ] o.a.z.s.NIOServerCnxn Close of session 0x0 => java.io.IOException: ZooKeeperServer not running at org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544) java.io.IOException: ZooKeeperServer not running at org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544) ~[?:?] at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332) ~[?:?] at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) ~[?:?] at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_252] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_252] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252] 2022-04-28 10:39:05.028 WARN (NIOWorkerThread-1) [ ] o.a.z.s.WorkerService Unexpected exception => java.lang.NullPointerException at org.apache.zookeeper.server.ZooKeeperServer.removeCnxn(ZooKeeperServer.java:285) java.lang.NullPointerException: null at org.apache.zookeeper.server.ZooKeeperServer.removeCnxn(ZooKeeperServer.java:285) ~[?:?] at org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:593) ~[?:?] at org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:583) ~[?:?] at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:374) ~[?:?] at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) ~[?:?] at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_252] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_252] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
And then later in the logs, I got errors like this:
2022-04-27 14:49:43.626 INFO (main) [ ] o.e.j.s.AbstractConnector Started ServerConnector@1329eff{HTTP/1.1, (http/1.1, h2c)}{0.0.0.0:8983} 2022-04-27 14:49:43.626 INFO (main) [ ] o.e.j.s.Server Started @38620ms 2022-04-27 14:50:05.641 ERROR (qtp572868060-18) [ ] o.a.s.s.SolrDispatchFilter Error processing the request. CoreContainer is either not initialized or shutting down. 2022-04-27 14:50:05.725 WARN (qtp572868060-18) [ ] o.e.j.s.HttpChannel /solr/admin/cores => javax.servlet.ServletException: javax.servlet.UnavailableException: Error processing the request. CoreContainer is either not initialized or shutting down. at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:162) javax.servlet.ServletException: javax.servlet.UnavailableException: Error processing the request. CoreContainer is either not initialized or shutting down. at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:162) ~[jetty-server-9.4.34.v20201102.jar:9.4.34.v20201102] at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[jetty-server-9.4.34.v20201102.jar:9.4.34.v20201102] at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322) ~[jetty-rewrite-9.4.34.v20201102.jar:9.4.34.v20201102]
And if you try to browse to the Solr instance now you get this friendly error:
And that explains the Sitecore instance failing to start: The Docker error is about the
solr-init
container being in an error state, which makes perfect sense if Solr itself isn't working.
I could use
dotnet new
to create a new instance of the example project. This would start the first time I ran
up.ps1
but would then show the same error as above on the second and subsequent attempts. I could use notepad to edit the
.env
file for one of these projects and change the
COMPOSE_PROJECT_NAME
variable. Once saved, this would allow the site to start once, and then subsequent attempts would fail with the same error. And I could edit the
docker-compose.yml
file(s) to tweak the settings (say, exposing a new port) and save that change. Again, it would start once and then subsequent attempts would fail. Rebooting my computer would also allow things to start once.
That suggests to me that something to do with building the image has an affect on this issue. Maybe something to do with there being state data which doesn't exist on the first start of a new image, but does exist (and causes an issue) on the subsequent starts.
So I could use these as work-arounds - but they all involved changing the project I was working on, or rebooting - which seemed heavy handed. At some point I'd accidentally commit one of my hacky changes to source control and risk messing up other team members' work...
And the overall issue made me very confused.
I wondered if Zookeeper was actually broken, when I was seeing the errors in the Solr logs. So I tried a test where I allowed
up.ps1
to run to the error shown above, and then connected a Powershell session into the Solr container with
docker exec -it sitecore_jss_poc3_solr_1 powershell.exe
. Once in there I could run a bit of script to call Zookeeper API to ask it if it was alive:
$client = New-Object System.Net.Sockets.TcpClient $client.Connect("localhost", 9983) $ns = [System.Net.Sockets.NetworkStream]$client.GetStream() $sendBytes = [System.Text.Encoding]::ASCII.GetBytes("ruok") $ns.Write($sendBytes, 0, $sendBytes.Length) $buffer = New-Object 'byte[]' 10 $bytesRead = $ns.Read($buffer, 0, 10) $receivedBytes = New-Object 'byte[]' $bytesRead [System.Array]::Copy($buffer, $receivedBytes, $bytesRead) $result= [System.Text.Encoding]::ASCII.GetString($receivedBytes) Write-Host "Response: $result"
Confusingly, that returned success:
So whatever is preventing Solr from seeing Zookeeper, Zookeeper itself is running...
At this point I hit Google pretty hard, spoke to others in my company, as well as asking about this in the Docker channel in Sitecore Slack. I found a small number of others who'd seen similar issue, and some muttering about how the current version of Docker might be to blame for some issues. (I had upgraded to
4.7.1 (77678)
recently, so that was plausible)
Reverting to Docker Desktop v4.6 didn't seem to resolve the issue. Stopping all the services for my office VPN didn't resolve the issue. One suggestion was to run
netsh winsock reset
in an elevated session and reboot. But that seemed to have no more effect that rebooting would have had on its own.
To revert back to "latest" Docker Desktop after these tests I did the uninstall, and noticed a few folders left lying about which seemed Docker-related. So I manually tidied all the files in
docker
and
docker desktop
folders under
c:\programdata\
and in the children of
%appdata%
on my machine. But after reinstalling that didn't seem to make a difference.
Next on my list to consider was, could this be a docker networking issue? I tried running
docker network ls
to see what existed:
PS C:\Users\jeremy.davis\source\repos\test> docker network ls NETWORK ID NAME DRIVER SCOPE a8aafbd09e1f Default Switch ics local 2190f8e7c643 nat nat local e8c1255ef410 none null local
The "Default Switch" entry wasn't there on a colleague's machine. And it went away when I speculatively ran the network prune command:
PS C:\Users\jeremy.davis\source\repos\test> docker network prune WARNING! This will remove all custom networks not used by at least one container. Are you sure you want to continue? [y/N] y Deleted Networks: Default Switch PS C:\Users\jeremy.davis\source\repos\test> docker network ls NETWORK ID NAME DRIVER SCOPE 2190f8e7c643 nat nat local e8c1255ef410 none null local
After doing this, my Solr test compose file would start. Not 100% of the time. But more than 50% of the time. Which is a significant step forwards, but clearly still not right somehow. And the main Sitecore compose project would start with the same reliability too.
Note:
Take care to ensure you've run
docker-compose down
before pruning - you'll get odd errors about not being able to start the containers if you don't:
If you get this "Network
But I've not managed to get it any better than this. I still get that error about not finding Zookeeper sometimes when starting the Solr container.
So I'm putting this here in case it makes more sense to someone else, or in case it helps someone else who's stuck. But it's probably neither the right solution, or a good one. Just a hacky workaround which has helped me.
My suspicion is that this is down to something specific in the build or state of my work laptop, since I don't see these issues on my personal laptop. Maybe one day I'll work out what that is. More likely I'll end up reformatting and it will magically go away after that. But for the moment I have the
docker network prune
work around to mostly bypass the error here and get on with some work...
To be honest, there are days when I don't really like computers... ;-)
Updated to add: After I posted this Rob Ahnemann posted his own take on how to deal with this issue. If you've hit a problem like this, you may want to check that out too, for an alternate take. There's also now a post from Gert Gullentops about his take on this issue too.↑ Back to top